Here I will using scikit-learn to perform PCA in Jupyter Notebook.
First, I need some example to get familiar with this
Get our data and analysis it
In [ ]:
import numpy as np
from sklearn.decomposition import PCA
import pandas as pd
In [ ]:
df = pd.read_csv('London.txt', sep='\s+')
# df.drop('id', axis=1, inplace=True) # 数据不像Manhattan,前期已经去除id项
df.tail()
how to index a given part of a DataFrame have been a problem for me.
Refer pandas/html/10min.html#selection-by-position
to keep in mind(link to file outside this dir not work well)
file:///C:/work/python/%E6%96%87%E6%A1%A3/pandas/html/10min.html#selection-by-position
In [ ]:
tdf = df.iloc[:, 1:-2]
tdf.tail()
取一个主成分, 解释方差0.917864
In [ ]:
pca = PCA(n_components=8)
pca.fit(tdf)
np.set_printoptions(precision=6, suppress=True)
print('各主成份方差贡献占比:', end=' ')
print(pca.explained_variance_ratio_)
emotion_score = pd.DataFrame(pca.transform(tdf))
# 第一个主成份
pd.concat([df, emotion_score.loc[:, 0]], axis=1, join='inner').rename(index=str, columns={0: 'emotion_score'}).to_csv('London_score_raw.txt', index=None, sep='\t')